convert add_limit to pipe step based limiting #2131

sh-rp · 2024-12-10T15:47:53Z

Description

Up until now we were managing limits inside a somewhat conflated function that was wrapping generators. The problem was that limits where applied before incrementals where and that we had a certain amount of code duplication with respect to wrapping async iterators. This PRs solves this.

netlify · 2024-12-10T15:48:13Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`fc013f5`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/676056ef7ec717000889d732

dlt/extract/utils.py

rudolfix

this looks so simple now and I think implementation is right. two things:

we have a test failing in a curious way. apparently we call rest API twice even if the limit is 1. why? we count items at the end of the pipe but there's just a single pipe. we must investigate
by counting at end of the pipe we change the behavior. I think it makes sense... but maybe we can add sticky flag as an argument to add_limit? so people can still stick it to gen object and count unfiltered items as before?

dlt/extract/items.py

rudolfix · 2024-12-11T00:02:33Z

dlt/extract/items.py

+        return self
+
+    def __call__(self, item: TDataItems, meta: Any = None) -> Optional[TDataItems]:
+        if self.count == self.max_items:


I'm not sure that is enough. we should close gen when we reach max items. but in this implementation we close gen at the end of the pipe. not all steps are sync steps, there are for example steps that yield. (or maybe even async steps, I do not remember). we should still return None when count > max_items.

I think we need to add more tests.

what happens if we do add_yield_map?

are we testing limit for async generators / iterators

any differences for round robin / fifo?

make sure that all expected items are pushed to transformer (this happens via special ForkStep)

I think this is all tested now, except for round robin and fifo, but I am quite sure that this will not make a difference, since rr and fifo only apply at the get_source_item level and there is no async stuff going on in the add_limit (it's all taken care of already in other places)

sh-rp · 2024-12-11T15:01:16Z

dlt/extract/resource.py

        if validator:
-            self.add_step(validator, insert_at=step_no if step_no >= 0 else None)
+            self.add_step(validator)


NOTE: i remove inserting at the same position in favor of automatic resolution via placement affinity. I think this makes more sense, I can revert to the old behavior though.

IMO we should put it exactly at previous place. users may want to transform/filter data items before or after this step. and that must be preserved

sh-rp · 2024-12-11T16:21:10Z

dlt/extract/items.py

+class LimitItem(ItemTransform[TDataItem]):
+    placement_affinity: ClassVar[float] = 1.1  # stick to end right behind incremental
+
+    def __init__(


I have moved the time limit and rate limiting over from the other PR. I think the time limit is fairly uncontroversial, the rate limiting is a little bit sketchy. I think it would be really cool to implement this with this PR, but we could also add a kind of global rate limiting on the Pipeiterator level that gets handed over from the DltSource to the Pipeiterator and is applied in the _get_source_item function to only extract a new item from any pipe if a min amount of time has passed.

update: only keeping the time limit here which is very straightforward to implement and I think pretty useful.

joscha · 2024-12-12T14:18:54Z

related to #2142

joscha · 2024-12-13T14:11:19Z

closes #2142

rudolfix

still missing two things:

actual test where limit is used with incremental to do the backfilling. my take is to add this to sql_database tests and (remember to add row_order)
having this example I'd add it to performance guide and say how to split large backfils. esp. mentioning that records should be ordered and not to retake data twice (ie. via WHERE clause)
on Py 3.11 some tests are not passing consistently. pls take a look

rudolfix · 2024-12-13T15:56:53Z

dlt/extract/resource.py

        if validator:
-            self.add_step(validator, insert_at=step_no if step_no >= 0 else None)
+            self.add_step(validator)


IMO we should put it exactly at previous place. users may want to transform/filter data items before or after this step. and that must be preserved

dlt/extract/items.py

add some convenience methods for pipe step management

sh-rp · 2024-12-15T21:07:29Z

still missing two things:

1. actual test where limit is used with incremental to do the backfilling. my take is to add this to sql_database tests and (remember to add row_order)

2. having this example I'd add it to performance guide and say how to split large backfils. esp. mentioning that records should be ordered and not to retake data twice (ie. via WHERE clause)

3. on Py 3.11 some tests are not passing consistently. pls take a look

I have added a general test that combines incremental and add_limit. I will also add a nice example using the rfam database I think but will have to do this tomorrow.

rudolfix

I'm not sure we should always wrap iterators. See my comment

dlt/extract/utils.py

tests/extract/test_incremental.py

sh-rp · 2024-12-16T13:29:53Z

@rudolfix I have delayed the iterator wrapping to the LimitItem binding now, I agree that this is probably a good idea. That said, the typing/importing is a bit messy now imho..

joscha · 2024-12-16T13:40:24Z

tests/extract/test_incremental.py

+
+    resource.add_limit(10)
+
+    p = dlt.pipeline(pipeline_name="incremtal_limit", destination="duckdb", dev_mode=True)


Suggested change

p = dlt.pipeline(pipeline_name="incremtal_limit", destination="duckdb", dev_mode=True)

p = dlt.pipeline(pipeline_name="incremental_limit", destination="duckdb", dev_mode=True)

rudolfix

LGTM!

sh-rp marked this pull request as ready for review December 10, 2024 15:47

sh-rp commented Dec 10, 2024

View reviewed changes

dlt/extract/utils.py Show resolved Hide resolved

sh-rp linked an issue Dec 10, 2024 that may be closed by this pull request

Filesystem Source incremental loading with S3 not working correctly #2124

Closed

rudolfix requested changes Dec 11, 2024

View reviewed changes

sh-rp commented Dec 11, 2024

View reviewed changes

rudolfix mentioned this pull request Dec 12, 2024

Google cloud run volume mounting #2133

Open

joscha mentioned this pull request Dec 13, 2024

docs: document that parallelized=True resources with add_limit(x) usually yield x-1 #2142

Closed

rudolfix requested changes Dec 13, 2024

View reviewed changes

rudolfix reviewed Dec 13, 2024

View reviewed changes

dlt/extract/items.py Outdated Show resolved Hide resolved

sh-rp mentioned this pull request Dec 15, 2024

[experiment] Add resource time limit and rate limiting #1485

Closed

sh-rp added 10 commits December 15, 2024 21:43

convert add_limit to step based limiting

fe174f8

prevent late arriving items to be forwarded from limit

fca53a8

add some convenience methods for pipe step management

added a few more tests for limit

5c8ef13

add more limit functions from branch

e4a9ec3

remove rate-limiting

33faff5

fix limiting bug and update docs

e3d4610

revert back to inserting validator step at the same position if replaced

c9fe2b3

make time limit tests more lenient for mac os tests

ac31cfd

tmp

a05ee7c

add test for testing incremental with limit

f109a87

sh-rp force-pushed the feat/make_limit_a_step branch from 3fc4d90 to f109a87 Compare December 15, 2024 21:04

sh-rp requested a review from rudolfix December 15, 2024 21:07

improve limit tests with parallelized case

caf92bf

rudolfix requested changes Dec 16, 2024

View reviewed changes

dlt/extract/utils.py Show resolved Hide resolved

tests/extract/test_incremental.py Show resolved Hide resolved

add backfill example with sql_database

2fa2bf8

sh-rp added 3 commits December 16, 2024 14:11

fix linting

dcc3d55

remove extra file

82c3d68

only wrap iterators on demand

9d38f56

joscha reviewed Dec 16, 2024

View reviewed changes

sh-rp linked an issue Dec 16, 2024 that may be closed by this pull request

Make batched loading more convenient #2136

Closed

move items transform steps into extra file

fc013f5

rudolfix approved these changes Dec 16, 2024

View reviewed changes

sh-rp merged commit 268768f into devel Dec 16, 2024
58 of 59 checks passed

rudolfix deleted the feat/make_limit_a_step branch December 19, 2024 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert add_limit to pipe step based limiting #2131

convert add_limit to pipe step based limiting #2131

sh-rp commented Dec 10, 2024 •

edited

Loading

netlify bot commented Dec 10, 2024 •

edited

Loading

rudolfix left a comment

rudolfix Dec 11, 2024

sh-rp Dec 12, 2024

sh-rp Dec 11, 2024

rudolfix Dec 13, 2024

sh-rp Dec 11, 2024

sh-rp Dec 12, 2024

joscha commented Dec 12, 2024

joscha commented Dec 13, 2024

rudolfix left a comment

rudolfix Dec 13, 2024

sh-rp commented Dec 15, 2024

rudolfix left a comment

sh-rp commented Dec 16, 2024

joscha Dec 16, 2024

rudolfix left a comment


		resource.add_limit(10)

		p = dlt.pipeline(pipeline_name="incremtal_limit", destination="duckdb", dev_mode=True)

	p = dlt.pipeline(pipeline_name="incremtal_limit", destination="duckdb", dev_mode=True)
	p = dlt.pipeline(pipeline_name="incremental_limit", destination="duckdb", dev_mode=True)

convert add_limit to pipe step based limiting #2131

convert add_limit to pipe step based limiting #2131

Conversation

sh-rp commented Dec 10, 2024 • edited Loading

Description

netlify bot commented Dec 10, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joscha commented Dec 12, 2024

joscha commented Dec 13, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Dec 15, 2024

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Dec 16, 2024

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Dec 10, 2024 •

edited

Loading

netlify bot commented Dec 10, 2024 •

edited

Loading